Spotify EDA of Personal Music Taste¶

Import Dependencies¶

In [1]:
# import dependencies for authentication
import os
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from spotipy.oauth2 import SpotifyOAuth
from dotenv import load_dotenv 
In [2]:
# import dependencies for eda
import numpy as np
import pandas as pd
import altair as alt

import seaborn as sns
import plotly.express as px 
import matplotlib.pyplot as plt
%matplotlib inline
    
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.metrics import euclidean_distances
from scipy.spatial.distance import cdist

import warnings
warnings.filterwarnings("ignore")

EDA Begins¶

In [3]:
# read in liked songs
liked_songs = pd.read_csv('liked_songs.csv', index_col = [0])

Basic Details¶

In [4]:
liked_songs.shape
Out[4]:
(4812, 23)

There are 4812 rows and 23 columns in this dataset. That's a lot of songs!

This already brings up some questions:

  1. What do all of these columns mean?
  2. What is the range of these columns
  3. How many songs did I add per year?
  4. Am I more likely to add songs at a particular time of day/on a particular day/month/year?
  5. How many songs do I have per artist?
  6. Which genre/artist do I listen to the most (have I added the most songs for)?
  7. Do I the songs I listen to tend to be more popular or less?
  8. How many songs do I listen to that have the same name?
  9. Is there a pattern to the audio features?

We've got some gears turning here. Let's start with the first two. We can use liked_songs.describe and the Spotify API dashboard to create a table of our columns names and a brief description.

In [5]:
# get column count, mean, std, etc.
liked_songs.describe()
Out[5]:
duration_s popularity acousticness speechiness key liveness instrumentalness energy tempo time_signature loudness danceability valence year month day
count 4812.000000 4812.000000 4812.000000 4812.000000 4812.000000 4812.000000 4812.000000 4812.000000 4812.000000 4812.000000 4812.000000 4812.000000 4812.000000 4812.000000 4812.000000 4812.000000
mean 212.061016 44.401288 0.391558 0.075203 5.054239 0.170380 0.031551 0.528759 119.505147 3.892352 -8.009924 0.576867 0.438185 2020.284289 7.111180 15.619077
std 53.039558 24.800363 0.337174 0.083933 3.517479 0.128426 0.146121 0.234326 30.490383 0.410889 4.124780 0.149988 0.226988 1.189322 3.717848 9.311429
min 34.050000 0.000000 0.000006 0.022500 0.000000 0.019300 0.000000 0.000280 39.120000 1.000000 -39.995000 0.067600 0.029100 2017.000000 1.000000 1.000000
25% 182.250500 29.000000 0.060300 0.033800 2.000000 0.097400 0.000000 0.348000 95.009750 4.000000 -9.730000 0.479000 0.254000 2020.000000 4.000000 7.000000
50% 207.762000 48.000000 0.309000 0.044400 5.000000 0.118000 0.000001 0.530000 118.065000 4.000000 -7.156500 0.583000 0.412000 2020.000000 7.000000 15.000000
75% 236.945250 64.000000 0.723000 0.074000 8.000000 0.193000 0.000151 0.715250 140.038250 4.000000 -5.289000 0.684000 0.607000 2021.000000 11.000000 24.000000
max 780.439000 100.000000 0.996000 0.929000 11.000000 0.976000 0.973000 0.986000 218.365000 5.000000 -1.148000 0.981000 0.974000 2022.000000 12.000000 31.000000

We can make a few observations here:

The average popularity of my songs is 44.45. With a popularity score range of 0 to 100, that's not very high at all! We should compare this with the distribution of popularity and the median.

My average song tempo is 119.55 which is classified as a moderate tempo.

In [6]:
# print the data tpe of each column
liked_songs.dtypes
Out[6]:
id                   object
name                 object
artists              object
duration_s          float64
popularity            int64
added_at             object
acousticness        float64
speechiness         float64
key                   int64
liveness            float64
instrumentalness    float64
energy              float64
tempo               float64
time_signature        int64
loudness            float64
danceability        float64
valence             float64
genre_list           object
genre                object
pitch_class          object
year                  int64
month                 int64
day                   int64
dtype: object

Let's find the range of my dataset:

In [7]:
# print the first and last date in the dataset
print("The first date I added a song to my `liked` list was " + str(liked_songs.added_at.min()))
print("The last date I added a song to my `liked` list was " + str(liked_songs.added_at.max()))
The first date I added a song to my `liked` list was 2017-03-03 19:50:46+00:00
The last date I added a song to my `liked` list was 2022-12-31 18:17:00+00:00

Now, let's find how many songs I've added to my Liked Songs per year.

In [8]:
# how many songs per year? 
liked_songs.year.value_counts()
Out[8]:
2020    2111
2022     937
2021     840
2019     710
2017     208
2018       6
Name: year, dtype: int64

Now we'll address which artists I listen to the most

In [9]:
# my top ten added artists overall
top_songs = liked_songs.artists.value_counts()[:25] 
top_songs = top_songs.reset_index()
top_songs.columns = ['artist', 'number_of_songs']

ax = sns.barplot(x = 'number_of_songs', y = 'artist', data = top_songs)
# ax.tick_params(axis='x', rotation=90)

Taylor Swift and Ed Sheeran have the greatest number of songs in my Liked Songs dataset. It would be interesting to see if this is true for this year and to compare the proportion of songs I have in my dataset/the number of total songs by that artist

In [10]:
# my top ten artists for 2022
top_songs_2022 =  liked_songs[liked_songs.year == 2022]['artists'].value_counts()[:25]
top_songs_2022 = top_songs_2022.reset_index()
top_songs_2022.columns = ['artist', 'number_of_songs']

ax = sns.barplot(x = 'number_of_songs', y = 'artist', data = top_songs_2022)
# ax.tick_params(axis='x', rotation=90)

Overall, I've added the most Taylor Swift, but in 2022, I've added the most 5 Seconds of Summer. It's also interesting to note that Ed Sheeran isn't in my top 10 for 2022, but he is in my top 10 overall.

In [11]:
# count of genres / what genres I listen to most often
liked_songs.genre.value_counts()[:20]
Out[11]:
dance pop                    777
pop                          692
alt z                        343
other                        329
boy band                     224
acoustic pop                 144
canadian pop                 125
canadian contemporary r&b    119
folk-pop                      94
bedroom pop                   92
modern rock                   74
indie folk                    69
neo mellow                    69
nyc pop                       56
art pop                       54
indie pop                     53
alternative r&b               53
adult standards               49
british soul                  48
post-teen pop                 44
Name: genre, dtype: int64

There's a lot of pop in there - completely expected. I can piece some of the other genres together, but alt z? So let's take a look at what kinds of songs are deemed alt z

In [12]:
liked_songs[liked_songs.genre == 'alt z'].head(5)
Out[12]:
id name artists duration_s popularity added_at acousticness speechiness key liveness ... time_signature loudness danceability valence genre_list genre pitch_class year month day
1 2VFetGqLYq0Pc8ZtRYCaeL In The Kitchen Reneé Rapp 227.749 73 2022-12-31 08:05:02+00:00 0.833 0.0504 0 0.2640 ... 3 -6.578 0.364 0.125 ['alt z', 'indie pop', 'pop'] alt z C/Do 2022 12 31
2 2nZq5WQOW4FEPxCVTdNGfB Kissin' In The Cold JP Saxe 224.083 73 2022-12-26 20:38:34+00:00 0.777 0.0360 0 0.1620 ... 4 -7.620 0.528 0.342 ['alt z', 'canadian contemporary r&b', 'neo me... alt z C/Do 2022 12 26
55 0eQJy4VAW7AkhKIHzXx3jG The First One Astrid S 188.649 44 2022-10-20 04:40:59+00:00 0.390 0.1600 7 0.1030 ... 4 -8.304 0.538 0.183 ['alt z', 'dance pop', 'norwegian pop', 'pop',... alt z G/Sol 2022 10 20
56 46ydq5g3k17iLJs3qMDvO6 Hurts So Good Astrid S 208.728 73 2022-10-20 04:40:55+00:00 0.084 0.0586 7 0.0957 ... 4 -5.027 0.675 0.378 ['alt z', 'dance pop', 'norwegian pop', 'pop',... alt z G/Sol 2022 10 20
109 6BgOYYhN3yzY3GzaUv3b7T Fun While It Lasted Ashe 146.436 46 2022-10-19 06:27:33+00:00 0.834 0.0316 9 0.1460 ... 4 -7.590 0.494 0.219 ['alt z', 'pop'] alt z A/La 2022 10 19

5 rows × 23 columns

Let's check the popularity of the music I listen to.

In [13]:
liked_songs[['added_at','name', 'artists', 'popularity', 'genre']].sort_values('popularity', ascending=False)[:20]
Out[13]:
added_at name artists popularity genre
217 2022-09-23 05:42:14+00:00 Unholy (feat. Kim Petras) Sam Smith 100 pop
4288 2019-12-25 08:39:13+00:00 All I Want for Christmas Is You Mariah Carey 99 dance pop
46 2022-10-21 16:15:23+00:00 Anti-Hero Taylor Swift 97 pop
162 2022-10-08 03:32:21+00:00 I Ain't Worried OneRepublic 95 piano rock
0 2022-12-31 18:17:00+00:00 Kill Bill SZA 95 pop
825 2022-04-01 00:18:10+00:00 As It Was Harry Styles 94 pop
2324 2020-10-01 00:25:25+00:00 Another Love Tom Odell 94 chill pop
643 2022-05-20 04:14:23+00:00 As It Was Harry Styles 93 pop
4283 2019-12-25 08:39:25+00:00 Snowman Sia 93 australian dance
4280 2019-12-25 08:39:32+00:00 It's Beginning to Look a Lot like Christmas Michael Bublé 92 adult standards
4287 2019-12-25 08:39:16+00:00 Mistletoe Justin Bieber 92 canadian pop
2425 2020-09-10 00:43:39+00:00 Blinding Lights The Weeknd 92 canadian contemporary r&b
565 2022-05-24 05:26:10+00:00 Late Night Talking Harry Styles 91 pop
388 2022-07-17 01:45:35+00:00 Bad Habit Steve Lacy 90 afrofuturism
4720 2017-03-19 21:59:23+00:00 Yellow Coldplay 90 permanent wave
42 2022-10-22 04:40:42+00:00 Midnight Rain Taylor Swift 89 pop
2890 2020-05-24 16:43:30+00:00 Heather Conan Gray 89 bedroom pop
45 2022-10-21 17:21:03+00:00 Lavender Haze Taylor Swift 89 pop
4758 2017-03-11 23:48:38+00:00 Perfect Ed Sheeran 89 pop
22 2022-11-24 02:43:16+00:00 Something in the Orange Zach Bryan 89 oklahoma country

A couple of names stand out here: Taylor Swift, Harry Styles, and Doja Cat because as far as general popularity as well as TikTok Popularity, these are pretty high up on the list. Most of the top 20 songs in my dataset based on popularity are within the pop genre as well.

That goes for Sam Smith's song 'Unholy' as well which has a popularity score of 100.

In [14]:
len(liked_songs[liked_songs.popularity > 79])*100/len(liked_songs)
Out[14]:
4.530340814630091

Only 4.53% of the songs I listen to have a popularity above 79%.

In [15]:
sns.histplot(data=liked_songs, x="popularity", color='lavender');

I was also curious about the number of songs I saved that have the same name.

In [16]:
sns.histplot(data=liked_songs[liked_songs.year == 2022], x="popularity", color = 'pink');

The distribution of my songs is very roughly normal, barring the songs with a 0 popularity score. Let's see how many songs have 0 popularity.

In [17]:
# zero popularity songs
zero_pop = liked_songs[liked_songs.popularity == 0]
zero_pop.head(10)
Out[17]:
id name artists duration_s popularity added_at acousticness speechiness key liveness ... time_signature loudness danceability valence genre_list genre pitch_class year month day
335 6rbeWjEavBHvX2kr6lSogS Work Out J. Cole 234.773 0 2022-07-17 03:16:52+00:00 0.05310 0.1060 2 0.3160 ... 4 -6.903 0.831 0.216 ['conscious hip hop', 'hip hop', 'north caroli... conscious hip hop D/Re 2022 7 17
337 6HHytHLXLX8QfWTtGfFSvH Drop It Like It's Hot Snoop Dogg 266.066 0 2022-07-17 03:16:43+00:00 0.16900 0.2160 1 0.1020 ... 4 -3.832 0.892 0.676 ['g funk', 'gangster rap', 'hip hop', 'pop rap... g funk C#/Do sost. 2022 7 17
338 2mpFm3f7QmdsVtSAIICEB7 Candy Shop 50 Cent 208.533 0 2022-07-17 03:16:41+00:00 0.03050 0.4810 7 0.3690 ... 5 -7.992 0.609 0.797 ['east coast hip hop', 'gangster rap', 'hip ho... east coast hip hop G/Sol 2022 7 17
346 2ng5lhEsASy6jgbOLg877a Gas Pedal Sage The Gemini 208.160 0 2022-07-17 03:15:53+00:00 0.02720 0.0576 1 0.1230 ... 4 -8.059 0.846 0.441 ['dance pop', 'hyphy', 'pop rap', 'rap', 'sout... dance pop C#/Do sost. 2022 7 17
356 06iMqWThw4w8fTFyccvOwr Ride Wit Me Nelly 291.781 0 2022-07-17 03:15:08+00:00 0.06680 0.0479 7 0.2470 ... 4 -6.625 0.854 0.753 ['canadian latin', 'canadian pop', 'dance pop'... canadian latin G/Sol 2022 7 17
366 2IpGdrWvIZipmaxo1YRxw5 Bottoms Up (feat. Nicki Minaj) Trey Songz 242.013 0 2022-07-17 03:12:30+00:00 0.02050 0.1610 1 0.3850 ... 4 -5.283 0.845 0.329 ['dance pop', 'hip pop', 'pop', 'r&b', 'southe... dance pop C#/Do sost. 2022 7 17
367 6C7RJEIUDqKkJRZVWdkfkH Stronger Kanye West 311.866 0 2022-07-17 03:12:21+00:00 0.00728 0.1550 10 0.3180 ... 4 -7.731 0.625 0.483 ['chicago rap', 'rap'] chicago rap A#/La sost. 2022 7 17
371 2aIdVb8v9KTpEZnftkz2mD Buy U a Drank (Shawty Snappin') (feat. Yung Joc) T-Pain 227.960 0 2022-07-17 03:11:54+00:00 0.01080 0.2620 1 0.0737 ... 4 -8.137 0.451 0.594 ['dance pop', 'hip hop', 'pop rap', 'r&b', 'ra... dance pop C#/Do sost. 2022 7 17
424 4gbC4p3lbOS04ZY0NMoNd8 Love Will Never Be the Same Gabriel Bernini 228.739 0 2022-07-04 07:14:44+00:00 0.32200 0.0383 0 0.2020 ... 4 -14.522 0.673 0.964 ['other'] other C/Do 2022 7 4
487 2ubI6GBwah4uiNYKI3gQFE Too Many Lovers Gabriel Bernini 181.649 0 2022-06-06 00:22:37+00:00 0.27000 0.0702 4 0.1380 ... 4 -11.823 0.885 0.875 ['other'] other E/Mi 2022 6 6

10 rows × 23 columns

These seem to be somewhat older songs in the dataset or what would be considered 'party' songs.

In [18]:
# looking at the name frequency of the songs I've added
liked_songs.name.value_counts()[:20]
Out[18]:
She              6
Daylight         6
Memories         5
Trouble          5
Hurricane        5
Boyfriend        5
One              4
Somebody Else    4
Enchanted        4
Shapeshifter     4
Invisible        4
Wolves           4
Ghosts           4
Golden           4
Waiting          4
Runaway          4
I'm Yours        4
The City         4
Roses            4
Colors           4
Name: name, dtype: int64

Does this say anything about the types of songs I enjoy listening to?

In [19]:
liked_songs[liked_songs.name == "She"]
Out[19]:
id name artists duration_s popularity added_at acousticness speechiness key liveness ... time_signature loudness danceability valence genre_list genre pitch_class year month day
1552 56XCgjSYaLg0TjxkVSpqNu She Ed Sheeran 244.653 37 2021-03-31 03:24:22+00:00 0.811000 0.0361 9 0.1300 ... 3 -10.910 0.573 0.3770 ['pop', 'uk pop'] pop A/La 2021 3 31
2000 63o0CBXZ6LAnJoqz26oRj9 She Meyru 334.608 0 2020-11-04 04:17:04+00:00 0.533000 0.0348 11 0.3400 ... 4 -9.908 0.569 0.0408 ['other'] other B/Si 2020 11 4
2549 1z1ztKUrDr09ZSMDnN3QIG She Selena Gomez 172.999 56 2020-08-28 02:39:22+00:00 0.277000 0.0481 2 0.0672 ... 4 -6.686 0.783 0.7230 ['dance pop', 'pop', 'post-teen pop'] dance pop D/Re 2020 8 28
2951 5hwzzutxeqeC5VMSpnfgul She Selena Gomez 172.999 27 2020-05-06 00:27:17+00:00 0.286000 0.0485 2 0.0682 ... 4 -6.687 0.784 0.7280 ['dance pop', 'pop', 'post-teen pop'] dance pop D/Re 2020 5 6
3375 6SQLk9HSNketfgs2AyIiMs She Harry Styles 362.653 75 2020-03-23 01:37:17+00:00 0.000532 0.0272 0 0.1900 ... 3 -5.942 0.535 0.4570 ['pop'] pop C/Do 2020 3 23
4770 3ICdPHubhqTJ4Lm9NEb2W3 She Ed Sheeran 244.653 47 2017-03-11 21:45:28+00:00 0.811000 0.0361 9 0.1300 ... 3 -10.910 0.573 0.3770 ['pop', 'uk pop'] pop A/La 2017 3 11

6 rows × 23 columns

In [20]:
# Let's remove these duplicates: FIXME this should actually go in preprocessesing
liked_songs_nodup = liked_songs.drop_duplicates(
  subset = ['name', 'artists'],
  keep = 'first').reset_index(drop = True)
In [21]:
liked_songs_nodup.shape
Out[21]:
(4402, 23)
In [22]:
# I'm going to change it to liked_songs because it's long
liked_songs = liked_songs_nodup

Visualization¶

Now that we've done some initial exploring, we'll create some visuals.

Visualization of all the audio features¶

In [23]:
audio_features = liked_songs[[ 'acousticness', 'speechiness', 'key', 'liveness', 
                                          'instrumentalness', 'energy', 'tempo', 'time_signature', 
                                          'loudness', 'danceability','valence',
                                         ]]
In [24]:
for col in audio_features:
        plt.figure(figsize=(20,2))
        sns.histplot(data=audio_features, x=col)
# change the color for each histplot?
In [25]:
for col in audio_features:
        plt.figure(figsize=(17,1))
        sns.boxplot(data=audio_features, x=col)
In [26]:
# find the top 6 genres in the dataset
liked_songs.genre.value_counts().head(6).axes
Out[26]:
[Index(['dance pop', 'pop', 'other', 'alt z', 'boy band', 'acoustic pop'], dtype='object')]
In [27]:
# liked_songs[liked_songs.genre in top_six_genres]

top_six_genres = liked_songs[liked_songs.genre.isin(['dance pop', 'pop', 'alt z', 'other', 'boy band', 'acoustic pop'])]
In [28]:
sns.boxplot(data=top_six_genres, x="acousticness", y="genre");

Okay, we can loop through and do this for all audio features if we want

In [29]:
for col in audio_features:
    plt.figure(figsize=(21,2))
    sns.boxplot(data=top_six_genres, x=col, y="genre", palette="Spectral");
In [30]:
# running the violin plot on top 6 genres
sns.violinplot(x=top_six_genres["genre"], y=top_six_genres["popularity"])
Out[30]:
<AxesSubplot:xlabel='genre', ylabel='popularity'>

Heatmap for correlation between variables¶

In [31]:
# create heatmap for correlation
plt.figure(figsize=(15, 10))
corr = liked_songs.corr()
sns.heatmap(corr, annot=True, cmap="coolwarm").set_title('Pearson correlation matrix')
plt.show()
In [32]:
# printing out correlation table
corr
Out[32]:
duration_s popularity acousticness speechiness key liveness instrumentalness energy tempo time_signature loudness danceability valence year month day
duration_s 1.000000 0.031815 0.034446 -0.132096 -0.003346 0.005957 0.061642 -0.080983 -0.011011 -0.056028 -0.081418 -0.240539 -0.274287 -0.076360 0.030608 -0.015812
popularity 0.031815 1.000000 -0.171208 0.032150 -0.013396 -0.043031 -0.098013 0.147871 0.031145 0.019090 0.155916 0.107498 0.081278 0.098730 0.040239 0.057228
acousticness 0.034446 -0.171208 1.000000 -0.118135 -0.013288 -0.102350 0.241087 -0.802025 -0.189376 -0.191801 -0.648510 -0.311816 -0.428647 -0.075358 -0.038523 -0.096997
speechiness -0.132096 0.032150 -0.118135 1.000000 0.043645 0.110887 -0.054260 0.140477 0.072261 0.045833 0.066252 0.194771 0.189573 -0.065093 0.100805 -0.000361
key -0.003346 -0.013396 -0.013288 0.043645 1.000000 0.022326 -0.030697 0.014709 0.019293 -0.002465 0.011437 0.031150 0.033060 0.003758 -0.011192 -0.024308
liveness 0.005957 -0.043031 -0.102350 0.110887 0.022326 1.000000 -0.053982 0.145014 0.037430 0.017698 0.092098 -0.013642 0.104309 -0.010005 -0.000381 0.012292
instrumentalness 0.061642 -0.098013 0.241087 -0.054260 -0.030697 -0.053982 1.000000 -0.263983 -0.073373 -0.088275 -0.549775 -0.272730 -0.169129 -0.055967 0.043692 -0.094861
energy -0.080983 0.147871 -0.802025 0.140477 0.014709 0.145014 -0.263983 1.000000 0.224378 0.213425 0.777324 0.269271 0.540316 0.041201 0.020076 0.117822
tempo -0.011011 0.031145 -0.189376 0.072261 0.019293 0.037430 -0.073373 0.224378 1.000000 -0.016496 0.177407 -0.092420 0.102179 0.034660 0.005372 0.029178
time_signature -0.056028 0.019090 -0.191801 0.045833 -0.002465 0.017698 -0.088275 0.213425 -0.016496 1.000000 0.185973 0.197015 0.161807 -0.000262 0.018368 0.037414
loudness -0.081418 0.155916 -0.648510 0.066252 0.011437 0.092098 -0.549775 0.777324 0.177407 0.185973 1.000000 0.325456 0.411620 0.046350 -0.001727 0.124057
danceability -0.240539 0.107498 -0.311816 0.194771 0.031150 -0.013642 -0.272730 0.269271 -0.092420 0.197015 0.325456 1.000000 0.503276 0.036595 0.034106 0.017247
valence -0.274287 0.081278 -0.428647 0.189573 0.033060 0.104309 -0.169129 0.540316 0.102179 0.161807 0.411620 0.503276 1.000000 0.051700 0.000238 0.043327
year -0.076360 0.098730 -0.075358 -0.065093 0.003758 -0.010005 -0.055967 0.041201 0.034660 -0.000262 0.046350 0.036595 0.051700 1.000000 -0.167197 -0.074223
month 0.030608 0.040239 -0.038523 0.100805 -0.011192 -0.000381 0.043692 0.020076 0.005372 0.018368 -0.001727 0.034106 0.000238 -0.167197 1.000000 0.102382
day -0.015812 0.057228 -0.096997 -0.000361 -0.024308 0.012292 -0.094861 0.117822 0.029178 0.037414 0.124057 0.017247 0.043327 -0.074223 0.102382 1.000000
In [33]:
corr.min()
Out[33]:
duration_s         -0.274287
popularity         -0.171208
acousticness       -0.802025
speechiness        -0.132096
key                -0.030697
liveness           -0.102350
instrumentalness   -0.549775
energy             -0.802025
tempo              -0.189376
time_signature     -0.191801
loudness           -0.648510
danceability       -0.311816
valence            -0.428647
year               -0.167197
month              -0.167197
day                -0.096997
dtype: float64
In [34]:
corr.max()
Out[34]:
duration_s          1.0
popularity          1.0
acousticness        1.0
speechiness         1.0
key                 1.0
liveness            1.0
instrumentalness    1.0
energy              1.0
tempo               1.0
time_signature      1.0
loudness            1.0
danceability        1.0
valence             1.0
year                1.0
month               1.0
day                 1.0
dtype: float64

Correlation values range from -1 to 1. The closer a correlation value is to 1 (positive or negative), the stronger the correlation between the two variables. Variables with a strong positive correlation increase together, whereas variables with a strong negative correlation experience opposing polarization (as one goes up, the other goes down and vice versa). The closer the correlation is to 0, the weaker the correlation.

  • variables will always have a correlation of 1 with themselves
  • we can see the strongest correlation is between energy and acousticness at -0.8.
  • the weakest correlation is between speechiness and dat at -0.00036

Overall, none of our variables are very strongly correlated.

In [35]:
# we can also create a pairplot of the correlations
sns.pairplot(liked_songs);

Audio Features

In [36]:
audio_descp = liked_songs.copy()
audio_descp = audio_descp.drop(columns=['year', 'month', 'day'])

The pairplot above is really crowded because it was run on all of the columns in our dataset. It shows the pairwise relationship between each column - so it's similar to the heat map but instead of correlation, it's the data plots of x against y where x is the first column of interest and y is the second column of interest.

In [37]:
# we can also create a pairplot of the correlations
sns.pairplot(audio_descp)
Out[37]:
<seaborn.axisgrid.PairGrid at 0x255a391b3d0>
In [38]:
# find the closest corr for each col
corr.columns
Out[38]:
Index(['duration_s', 'popularity', 'acousticness', 'speechiness', 'key',
       'liveness', 'instrumentalness', 'energy', 'tempo', 'time_signature',
       'loudness', 'danceability', 'valence', 'year', 'month', 'day'],
      dtype='object')
In [39]:
corr.iloc[:,0] # everything for first col
Out[39]:
duration_s          1.000000
popularity          0.031815
acousticness        0.034446
speechiness        -0.132096
key                -0.003346
liveness            0.005957
instrumentalness    0.061642
energy             -0.080983
tempo              -0.011011
time_signature     -0.056028
loudness           -0.081418
danceability       -0.240539
valence            -0.274287
year               -0.076360
month               0.030608
day                -0.015812
Name: duration_s, dtype: float64
In [40]:
# plan: get the max corr for each column
# plot the max corr
In [41]:
# create a histogram of the popularity of the tracks
sns.histplot(data=liked_songs, x="popularity", bins=25);

This is just a visual confirmation of how popular my music tastes are. The majority of my music tastes fall in the mid-range in terms of popularity

In [42]:
# mean popularity score value
liked_songs.popularity.mean()
Out[42]:
45.02317128577919

What about the popularity in 2022?

In [43]:
liked_songs[liked_songs.year == 2022].popularity.mean()
Out[43]:
51.80294450736127

It'll be interesting to see how the popularity of my songs has changed over the years.

In [44]:
# histogram of duration in seconds
sns.histplot(liked_songs.duration_s, bins=25);
In [45]:
(liked_songs.duration_s.mean())/60
Out[45]:
3.524018279569893

Most of my saved songs are around 3 minutes long.

In [46]:
# longest song in my saved
liked_songs.iloc[ liked_songs['duration_s'].idxmax() ][['artists','name']]
Out[46]:
artists    Garth Stevenson
name                  Dawn
Name: 4154, dtype: object
In [47]:
# shortest song in my saved
liked_songs.iloc[liked_songs['duration_s'].idxmin() ][['artists','name']]
Out[47]:
artists       Bo Burnham
name       Unpaid Intern
Name: 1286, dtype: object
In [48]:
liked_songs_2022 = liked_songs[liked_songs.year > 2021]
print(str(len(liked_songs_2022)) + " songs have been added in 2022")
883 songs have been added in 2022
In [49]:
liked_songs_2022.head(5)
Out[49]:
id name artists duration_s popularity added_at acousticness speechiness key liveness ... time_signature loudness danceability valence genre_list genre pitch_class year month day
0 3OHfY25tqY28d16oZczHc8 Kill Bill SZA 153.946 95 2022-12-31 18:17:00+00:00 0.054300 0.0351 8 0.1610 ... 4 -5.750 0.644 0.430 ['pop', 'r&b'] pop G#/Sol sost. 2022 12 31
1 2VFetGqLYq0Pc8ZtRYCaeL In The Kitchen Reneé Rapp 227.749 73 2022-12-31 08:05:02+00:00 0.833000 0.0504 0 0.2640 ... 3 -6.578 0.364 0.125 ['alt z', 'indie pop', 'pop'] alt z C/Do 2022 12 31
2 2nZq5WQOW4FEPxCVTdNGfB Kissin' In The Cold JP Saxe 224.083 73 2022-12-26 20:38:34+00:00 0.777000 0.0360 0 0.1620 ... 4 -7.620 0.528 0.342 ['alt z', 'canadian contemporary r&b', 'neo me... alt z C/Do 2022 12 26
3 27AHAtAirQapVldIm4c9ZX Jump Kris Kross 195.106 67 2022-12-24 01:28:06+00:00 0.000501 0.1640 2 0.3330 ... 4 -10.625 0.749 0.481 ['hip hop', 'old school atlanta hip hop', 'rap'] hip hop D/Re 2022 12 24
4 7fldJ1m6Ol6tbOIzA9y9zg Grow Up Olly Murs 225.173 49 2022-12-22 15:09:12+00:00 0.255000 0.0546 8 0.0906 ... 4 -3.938 0.571 0.483 ['dance pop', 'pop', 'pop dance', 'post-teen p... dance pop G#/Sol sost. 2022 12 22

5 rows × 23 columns

In [50]:
sns.scatterplot(data=liked_songs_2022, x='energy', y='acousticness');

Above is the visualization of we saw earlier in the heatmap. As energy increases, the amount of acousticness the song has tends to decrease. And we can fit this with linear regression to point out an even cleaner pattern.

In [51]:
sns.regplot(data=liked_songs_2022, x='energy', y='acousticness');

How have the audio features of my song choices changed over time? We can write a time series function to plot the change of our data over timie

In [52]:
def plot_time_series(col_name, title, rolling_window_days):
    daily_series = pd.Series(data=np.array(liked_songs[col_name]), 
                                      name=col_name, 
                                      index=liked_songs['added_at']).sort_index()

    (daily_series.rolling(window = rolling_window_days)
     .mean()
     .plot(figsize=(30, 10))
     .set(xlabel='date (by day)', ylabel=col_name, title=title))

    plt.show()    
In [53]:
plot_time_series('popularity', 'Popularity over time (window = 30 days)', 30)
plot_time_series('duration_s', 'Duration (s) over time (window = 30 days)', 30)
plot_time_series('danceability', 'Danceability over time (window = 30 days)', 30)
plot_time_series('valence', 'Valence over time (window = 30 days)', 30)
plot_time_series('energy', 'Energy over time (window = 30 days)', 30)
plot_time_series('tempo', 'Tempo over time (window = 30 days)', 30)

In general, our plots above are fairly stationary from 2020 on.

In [54]:
# plt.figure(figsize=(15, 10))
sns.boxplot(x=liked_songs['valence']).set_title('Tracks valence (1 = happy, 0 = sad)')
plt.show()
In [55]:
# plt.figure(figsize=(15, 10))
sns.boxplot(x=liked_songs['loudness']).set_title('Loudness')
plt.show()
In [56]:
# plt.figure(figsize=(15, 10))
sns.boxplot(x=liked_songs['tempo']).set_title('Tempo')
plt.show()
In [57]:
# plt.figure(figsize=(15, 10))
sns.boxplot(x=liked_songs['energy']).set_title('Energy')
plt.show()
In [58]:
# !pip install yellowbrick
from yellowbrick.target import FeatureCorrelation

# define columns to select
feature_names = ['acousticness', 'danceability', 'energy', 'instrumentalness',
       'liveness', 'loudness', 'speechiness', 'tempo', 'valence','duration_s','key','year']

X, y = liked_songs_2022[feature_names], liked_songs_2022['popularity']

# Create a list of the feature names
features = np.array(feature_names)

# Instantiate the visualizer
visualizer = FeatureCorrelation(labels=features)

plt.rcParams['figure.figsize']=(6,6)
visualizer.fit(X, y)     # Fit the data to the visualizer
visualizer.show();

Make this a graph instead

Year Number of songs
2020 2112
2022 907
2021 842
2019 711
2017 208
2018 6

Taylor Swift is the artist I have added the greatest number of songs for over the years. How did my addition of Taylor Swift songs change throughout the years, or did it? Note: We'll want to pay attention to album releases since she took a brief hiatus and I will often add her songs immediately when an album is released It might also be interesting to see if I add the songs the same year the song is released or later on.

In [59]:
liked_songs['year'] = pd.DatetimeIndex(liked_songs['added_at']).year
liked_songs['month'] = pd.DatetimeIndex(liked_songs['added_at']).month
liked_songs['day'] = pd.DatetimeIndex(liked_songs['added_at']).day
In [60]:
taylor_swift = liked_songs[liked_songs.artists == 'Taylor Swift']
taylor_swift
Out[60]:
id name artists duration_s popularity added_at acousticness speechiness key liveness ... time_signature loudness danceability valence genre_list genre pitch_class year month day
32 3UMrglJeju5yWyYIW6o99b The Great War Taylor Swift 240.355 85 2022-11-01 05:13:52+00:00 0.21900 0.0353 5 0.0842 ... 4 -8.987 0.573 0.554 ['pop'] pop F/Fa 2022 11 1
38 0A1JLUlkZkp2EFrosoNQi0 Labyrinth Taylor Swift 247.962 81 2022-10-22 04:42:40+00:00 0.78500 0.0517 0 0.1220 ... 4 -15.480 0.406 0.122 ['pop'] pop C/Do 2022 10 22
39 3qoftcUZaUOncvIYjFSPdE Bejeweled Taylor Swift 194.165 86 2022-10-22 04:41:51+00:00 0.06180 0.0693 7 0.0887 ... 4 -9.190 0.696 0.433 ['pop'] pop G/Sol 2022 10 22
40 1xwAWUI6Dj0WGC3KiUPN0O Vigilante Shit Taylor Swift 164.801 84 2022-10-22 04:41:17+00:00 0.17300 0.3900 4 0.1210 ... 4 -11.096 0.798 0.163 ['pop'] pop E/Mi 2022 10 22
41 3rWDp9tBPQR9z6U5YyRSK4 Midnight Rain Taylor Swift 174.782 89 2022-10-22 04:40:42+00:00 0.69000 0.0767 0 0.1150 ... 4 -11.738 0.643 0.230 ['pop'] pop C/Do 2022 10 22
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3665 0xvsgzM8AtBtRHZm5rav8A The Other Side Of The Door Taylor Swift 237.600 43 2019-12-30 00:17:51+00:00 0.16900 0.0347 4 0.2380 ... 4 -4.279 0.476 0.469 ['pop'] pop E/Mi 2019 12 30
3666 7zMcNqs55Mxer82bvZFkpg Teardrops On My Guitar - Radio Single Remix Taylor Swift 203.040 66 2019-12-30 00:17:25+00:00 0.28800 0.0231 10 0.1190 ... 4 -6.941 0.621 0.289 ['pop'] pop A#/La sost. 2019 12 30
3667 4pJi1rVt9GNegU9kywjg4z Teardrops on My Guitar - Pop Version Taylor Swift 179.066 53 2019-12-30 00:17:25+00:00 0.04020 0.0537 10 0.0863 ... 4 -3.827 0.459 0.483 ['pop'] pop A#/La sost. 2019 12 30
3728 1dGr1c8CrMLDpV6mPbImSI Lover Taylor Swift 221.306 85 2019-12-28 00:19:55+00:00 0.49200 0.0919 7 0.1180 ... 4 -7.582 0.359 0.453 ['pop'] pop G/Sol 2019 12 28
3907 6RRNNciQGZEXnqk8SQ9yv5 You Need To Calm Down Taylor Swift 171.360 77 2019-12-27 23:52:36+00:00 0.00929 0.0553 2 0.0637 ... 4 -5.617 0.771 0.714 ['pop'] pop D/Re 2019 12 27

206 rows × 23 columns

In [61]:
sns.countplot(data=taylor_swift, y="year",hue='pitch_class');
In [62]:
sns.countplot(data=taylor_swift, y="year",hue='month');
In [63]:
sns.countplot(data=liked_songs, x="month", hue='year');
In [64]:
# compares plot of the difference between the number of songs added each month compared by year
sns.lineplot(data=df, x="month", y="count", hue="year", palette = "Paired");
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-64-16b5c749b323> in <module>
      1 # compares plot of the difference between the number of songs added each month compared by year
----> 2 sns.lineplot(data=df, x="month", y="count", hue="year", palette = "Paired");

NameError: name 'df' is not defined

Maybe use Altair instead to put these side by side per year

In [ ]:
t_swizzle = liked_songs[liked_songs.artists == 'Taylor Swift']
In [ ]:
liked_songs.columns
In [ ]:
sns.scatterplot(data=t_swizzle, x='year', y='popularity');

plt.xticks(rotation=85);

Change the above to a floating boxplot

In [ ]:
sns.boxplot(data=t_swizzle, y='year', x='popularity');

plt.xticks(rotation=90);

T-Swizzle¶

In [ ]:
taylor_swift = liked_songs[liked_songs.artists =='Taylor Swift']
In [ ]:
taylor_swift.columns
In [ ]:
taylor_swift.key.unique()
In [ ]:
sns.histplot(data=taylor_swift, x="popularity", hue='key');

Should I compare the data for my top artist this year vs last year?

Most of TSwizzle's songs fall in the popularity of around 65-70

I want to see what the most and least popular song (tswizzle and otherwise) per year that I listen to per year

Further visualization:

  1. which months I tend to add the most songs in
  2. compare my songs to spotify's top songs chart
  3. map each audio feature to a color scale and create a picture or hue based on my artist and track preferences <-- audio aura
  • add release date year for each track
In [ ]:
# getting the most popular tracks in the dataset
liked_songs.sort_values(by=['popularity'], ascending=False)[['name', 'artists']].head(20)
In [ ]:
# top 20 genres by count
top_20c = pd.DataFrame(liked_songs['genre'].value_counts().head(20)).reset_index()
top_20c.set_axis(['genre', 'count'], inplace=True, axis=1)
sns.barplot(data=top_20c, y='genre', x='count').set(title='Number of Tracks By Genre (Top 20)');
In [ ]:
# popularity of the songs
# get the average popularity of the genres
top_20 = liked_songs.groupby('genre').mean().sort_values(by='popularity', ascending=False).head(20).reset_index()
top_20
sns.barplot(data=top_20, y='genre', x='popularity').set(title='Popularity of Tracks By Genre (Top 20)');

Bivariate KDE¶

In [ ]:
sns.set(rc = {'figure.figsize':(20,20)})
sns.jointplot(data=liked_songs, x="loudness", y="energy", kind="kde");
In [ ]:
!jupyter nbconvert eda.ipynb --to html --output eda_code.html
In [ ]: